Week 1: Review!

PS 818 - Statistical Models

Anton Strezhnev

University of Wisconsin-Madison

September 3, 2025

Welcome!

\[ \require{cancel} \DeclareMathOperator*{\argmin}{arg\,min} \]

\[ \DeclareMathOperator*{\argmax}{arg\,max} \]

Course Overview

  • Instructor: Anton Strezhnev

  • Logistics:

    • Lectures Mon/Wed - 9:30am - 10:45pm;
    • 4 Problem Sets (~ 2 weeks)
    • Midterm (~ 1 week)
    • In-person final (TBA)
    • Office hours: Just drop in!
  • What is this course about?

    • Defining statistical models via their data-generating process
    • Estimating model parameters and conducting inference
    • Interpreting model output and evaluating model quality

Course objectives

  • Give you the tools you need to understand descriptive inference via statistical models and comment on other researchers’ work.
  • Equip you with an understanding of the fundamentals of likelihood and Bayesian inference to enable you to learn new models that build on these principles.
  • Connect these principles to the particular research questions that you want to answer.
  • Teach you how to program and implement estimators by yourself!

Course workflow

  • Lectures
    • Topics organized by week
    • Lectures are the “course notes” – readings are the reference manuals.
  • Readings
    • Mix of textbooks and papers
    • All readings available digitally on Canvas

Course workflow

  • Problem sets (25% of your grade)
    • Meant as a check on your understanding of the material and a way of communicating with me about the course
    • Collaboration is strongly encouraged – you should ask and answer questions on our Ed discussion board.
    • Graded holistically on a plus/check/minus system.

Course workflow

  • Midterm and Final exam (25% and 40% of your grade)
    • The midterm exam will be structured like the problem sets with two main differences:
      • You have about 1 week to complete them instead of 2
      • You may not collaborate with one another
    • The final exam will be a written in-person exam
      • Slightly more theory heavy, but some questions will require you to analyze code + output.
  • Participation (10% of your grade)
    • It is important that you actively engage with lecture and section – ask and answer questions.
    • Do the reading!
    • Participating on the discussion board counts towards this as well.

Assignment Timeline

  • Problem Set 1: Assigned September 9, Due September 22
  • Problem Set 2: Assigned September 29, Due October 13
  • Midterm Exam Assigned October 14, Due October 20 (1 week)
  • Problem Set 3: Assigned October 28, Due November 10
  • Problem Set 4: Assigned November 11, Due December 1 (Extra time due to Thanksgiving)
  • Final Exam: TBA (whenever/wherever I can book a room)

Class Requirements

  • Overall: An interest in learning and willingness to ask questions.

  • Assume a background in intro probability and statistics (1st year sequence)

    • You should be comfortable thinking about basic estimands/estimators + their properties
    • You should be able to interpret a confidence interval for (e.g.) a difference-in-means.
  • Some prior knowledge of causal inference helpful but not critical

    • We’ll be connecting predictive models to causal estimands
    • Ideally should be familiar with the potential outcomes framework
  • You should also be familiar with linear regression

    • \(\hat{\beta} = (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}Y\) should be a familiar expression
    • You should know under what conditions it’s unbiased for \(\mathbb{E}[Y|X]\), and under what conditions it’s efficient.
  • If you want some review, check out chapters 1-6 of “Regression and Other Stories”

A brief overview

  • Week 2-4: Introduction to likelihood inference and GLMs

    • Concept of the likelihood, MLE as an estimator + asymptotic properties
    • Binary outcome models, count models, duration models
  • Week 5-7: Bayesian Inference and Multilevel Models

    • Principles of Bayesian inference – posteriors, priors, data
    • Quantities of interest: posterior means, credible intervals
    • Estimation via MCMC
    • Application to multilevel regression models
  • Week 8: Survey data

    • Applying multilevel regression methods to survey data
    • Survey weighting to address non-random sampling.
  • Week 9: Mixture Models and the EM algorithm

  • Week 10: Item response theory and ideal point models

  • Week 11-13: Flexible regression (ridge/lasso, forests, kernels)

  • Week 14 Semi-parametric theory

  • Week 15: Big regressions!

Estimation review

Random variables

  • Understanding the behavior and properties of random variables is at the core of statistical theory.
  • (Simply put) a random variable \(X\) is a mapping from a sample space to the real number line
    • Random variables have a distribution (which we may or may not assume we know) defined by the cumulative distribution function (CDF)
    \[F(x) = Pr(X \le x)\]

Random variables

  • Discrete random variables take on a countable number of values (e.g. Bernoulli r.v. can take on 0 or 1) and have a probability mass function (PMF)

    \[p(x) = Pr(X = x)\]

  • Continuous random variables take on an uncountable number of values (e.g. the Normal distribution on \((-\infty, \infty)\)).

    • No PMF, but have a density function (PDF) that integrates to a probability

    \[Pr(X \in \mathcal{A}) = \int_{\mathcal{A}} f(x)dx\]

Remember: PMFs (and PDFs) sum (integrate) to \(1\) over the support of the random variable.

Expectations

  • One important property of a random variable is its expectation \(\mathbb{E}[X]\). We’ll often make assumptions about the expectation of an R.V. while remaining agnostic about its true distribution.

    • The expectation is a weighted average. For a discrete r.v. \(X\), we sum over the support of the random variable \(\mathcal{X}\).

    \[\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x Pr(X = x)\]

  • For continuous r.v. we have an integral

    \[\mathbb{E}[X] = \int_{x \mathcal{X}} x f(x) dx\]

  • Fun fact: we can get the expectation of any function of \(g(X)\) just by plugging it into the integral

    \[\mathbb{E}[g(X)] = \int_{x \mathcal{X}} g(x) f(x) dx\]

Expectations

  • You’ll need to know some essential properties of expectations to simplify certain problems

  • Most important. Linearity. For any two random variables \(X\) and \(Y\) and constants \(a\) and \(b\)

    \[\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\]

  • Note that for any generic function \(g()\), \(\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\). If \(g()\) is convex, by Jensen’s inequality \(\mathbb{E}[g(X)] \ge g(\mathbb{E}[X])\)

  • For a binary r.v. \(X \in \{0, 1\}\), it’s helpful to remember the “fundamental bridge” between expectations and probability

    \[\mathbb{E}[X] = Pr(X = 1)\]

Variance

  • We also care about the spread of a random variable - how far is the average draw of \(X\) from its mean \(\mathbb{E}[X]\).One measure of this is the variance.

    \[Var(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\]

  • Also written as

\[Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]

  • Note that the square is a convex function. Which means that by Jensen’s inequality \(\mathbb{E}[X^2] \ge \mathbb{E}[X]^2\). Variances cannot be negative!
  • We also can define a covariance between two variables (does \(X\) take high values when \(Y\) takes high values?)

\[Cov(X, Y) = E\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]

Variance

  • Variances also have some useful properties.
  • For a constant \(a\)

\[Var(aX) = a^2Var(X)\] - For any two random variables \(X\) and \(Y\)

\[Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\] \[Var(X - Y) = Var(X) + Var(Y) - 2Cov(X,Y)\]

  • For independent random variables \(X\) and \(Y\)

\[Var(X + Y) = Var(X) + Var(Y)\] \[Var(X - Y) = Var(X) + Var(Y)\]

Conditional probabilities

  • We will also spend a lot of time with conditional distributions and conditional expectations of random variables.
    • What’s the probability that an individual enrolls in a job training program given their income?
  • We represent the conditioning set using a vertical bar with the right-hand side denoting what is being conditioned on.
    • For example: \(Pr(D_i = 1 | X_i = x)\)

Conditional probabilities

  • Key concept - Dependence and independence. If two variables are independent, the distribution of one does not change conditional on the other. We’ll write this using the \(\perp \!\!\! \perp\) notation.

    • \(Y_i \perp \!\!\! \perp D_i\) implies

    \[f(Y_i | D_i = 1) = f(Y_i| D_i = 0) = f(Y_i)\]

  • Two variables can be conditionally independent in that they are independent only when conditioning on a third variable. For example, we can have \(Y_i \cancel{\perp \!\!\! \perp} D_i\) but \(Y_i \perp \!\!\! \perp D_i | X_i\). This implies

    \[f(Y_i| D_i = 1, X_i = x) = f(Y_i| D_i = 0, X_i = x) = f(Y_i | X_i =x)\]

  • Remember: Conditional independence does not imply independence or vice-versa!

Conditional expectations

  • A central object of interest in statistics is the conditional expectation function (CEF) \(\mathbb{E}[Y | X]\).

    • Given a particular value of \(X\), what is the expectation of \(Y\)?
    • The CEF is a function of \(X\).
  • All the usual properties of expectations apply to conditional expectations.

    • We also will often make use of the law of total expectation

    \[\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]]\]

  • Easiest to think about this in terms of discrete r.v.s

    \[\mathbb{E}[Y] = \sum_{x \in \mathcal{X}} \mathbb{E}[Y | X = x] Pr(X = x)\]

Estimation

  • One critical use of statistical theory is understanding how to learn about things we don’t observe using things that we do observe. We call this estimation.
    • e.g. What is the share of voters in Wisconsin who will turn out in the 2026 election?
    • What is the share of voters who turn out among those assigned to receive a GOTV phone call?
  • Estimand: The unobserved quantity that we want to learn about. Often denoted via a greek letter (e.g. \(\mu\), \(\pi\))
    • Often a “population” characteristic that we want to learn about via a sample.
      • Although recall causal estimands can’t be fully observed even in a finite sample!
    • Important to define your estimand well. (Lundberg, Johnson and Stewart, 2022)

Estimation

  • Estimator: The function of random variables that we will use to try to estimate the quantity of interest. Often denoted with a hat on the parameter of interest (e.g. \(\hat{\mu}\), \(\hat{\pi}\))
    • Why are the variables random?
      • Classic inference: We have a random sample from the population – if we took another sample, we would obtain a different realization of our estimator.
      • Randomization inference: We have a randomly assigned treatment – if we were to re-run the experiment, we would observe a different treatment/control allocation.
  • Estimate: A single realization of our estimator (e.g. 0.3, 9.535)
    • We often report both point estimates (“best guess”) and interval estimates (e.g. confidence intervals).
    • Careful not to confuse properties of estimators with properties of the estimates themselves.

Estimation

Estimation

  • The classic estimation problem in statistics is to estimate some unknown population mean \(\mu\) from an i.i.d. sample of \(n\) observations \(Y_1, Y_2, \dotsc, Y_n\).
    • We assume that each \(Y_i\) is a draw from the target population with mean \(\mu\). (identically distributed) – therefore \(\mathbb{E}[Y_i] = \mu\)
    • We’ll also assume that knowing \(Y_i\) tells us nothing about any other \(Y_j\) \(Y_i \perp \!\!\! \perp Y_j\) (independently distributed) – this implies \(Cov(Y_i, Y_j) = 0\)
  • Our estimand: \(\mu\)
  • Our estimator: The sample mean \(\hat{\mu} = \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)
  • Our estimate: A particular realization of that estimator based on our observed sample (e.g. \(0.4\))

Estimation

  • Note that our estimator is a random variable – it’s a function of \(Y_i\)s which are random variables.
    • Therefore it has an expectation \(\mathbb{E}[\hat{\mu}]\) (assuming \(Y_i\) has an expectation)
    • It has a variance \(Var(\hat{\mu})\) (again, under regularity conditions)
    • It has a distribution (which we may or may not know).

Estimation

  • How do we know if we’ve picked a good estimator? Will it be close to the truth? Will it besystematically higher or lower than the target?

  • We want to derive some of its properties

    • Bias: \(\mathbb{E}\hat{\mu}] - \mu\)
    • Variance: \(Var(\hat{\mu})\)
    • Consistency: Does \(\hat{\mu}\) converge in probability to \(\mu\) as \(n\) goes to infinity?
    • Asymptotic distribution: Is the sampling distribution of \(\hat{\mu}\) well approximated by a known distribution?

Unbiasedness

  • Is the expectation of \(\hat{\mu}\) equal to \(\mu\)?

    \[\mathbb{E}[\hat{\mu}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n}E\left[\sum_{i=1}^n Y_i\right]\]

  • Next we use linearity of expectations

    \[\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right]\]

  • Finally, under our i.i.d. assumption

    \[\frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n \mu}{n} = \mu\]

  • Therefore, the bias, \(\text{Bias}(\hat{\mu}) = \mathbb{E}[\hat{\mu}] - \mu = 0\)

Variance

  • What is the variance of \(\hat{\mu}\)? Again, start by pulling out the constant.

    \[Var(\hat{\mu}) = Var\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right]\]

  • We can further simplify by using our i.i.d. assumption. The variance of a sum of i.i.d. random variables is the sum of the variances.

    \[\frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right]\]

  • “identically distributed”

    \[\frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}\]

  • Therefore, the variance is \(\frac{\sigma^2}{n}\)

Asymptotic behavior

  • As \(n\) gets large, what can we say about the estimator \(\hat{\mu}\).

  • First, we can show that it is consistent – it converges in probability to the true parameter \(\mu\)

    • Unbiasedness + Variance that goes to \(0\) as \(n\) gets large.
    • Some estimators may be biased but have bias terms that go to \(0\) – if variance also goes to \(0\) these are still consistent.
  • Second, we can say something about the distribution of \(\hat{\mu}\).

    • Remember, we’ve only made assumptions about \(\mathbb{E}[Y_i]\) and \(Var(Y_i)\) (that they exist). We have made no assumptions on the distribution of \(Y_i\). \(Y_i\) can be normal, poisson, bernoulli, or whatever!
    • However, we know something about sums and means of random variables – they are well-approximated by a normal distribution. The Central Limit Theorem!
    • So in large samples, the sampling distribution of \(\hat{\mu}\) is close to normal. This lets us construct confidence intervals and do inference with this approximation and be confident that we won’t be far off!

Regression review

  • Rather than just estimating a population mean, we are more typically interested in some population conditional expectation \(\mathbb{E}[Y|X]\)
    • \(Y_i\): Outcome/response/dependent variable
    • \(X_i\): Vector of regressor/independent variables
  • “How does the expected value of \(Y\) differ across different values of \(X\)?”
  • Suppose we observe \(N\) paired observations of \(\{Y_i, X_i\}\).
    • How do we construct a “good” estimator of \(\mathbb{E}[Y|X]\)?
    • What assumptions do we have to make to get…consistency…unbiasedness…efficiency?

Regression review

  • Consider the ordinary least squares estimator \(\hat{\beta}\) which solves the minimization problem:

    \[\hat{\beta} = \argmin_b \ \sum_{i=1}^N (Y_i - X_ib)^2\]

  • We can do some algebra and find a closed form solution for this optimization problem

    \[\hat{\beta} = (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\]

Regression review

  • Assumption 1: Linearity

    \[Y = \mathbf{X}\beta + \epsilon\]

  • Assumption 2: Strict exogeneity of the errors

    \[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]

  • These two imply:

    • Linear CEF

    \[\mathbb{E}[Y|\mathbf{X}] = \mathbf{X}\beta = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \dotsc \beta_kX_{k}\]

  • Best case: Our CEF is truly linear (by luck or we have a saturated model)

  • Usual case: We’re at least consistent for the best linear approximation to the CEF

Regression review

  • Assumption 3: No perfect collinearity

    • \(\mathbf{X}^{\prime}\mathbf{X}\) is invertible
    • \(\mathbf{X}\) has full column rank
  • This assumption is needed for identifiability – otherwise no unique solution to the least squares minimization problem exists!

  • Fails when one column can be written as a linear combination of the others

    • Or when there are more regressors than observations \(k > n\)

Regression review

  • Under assumptions 1-3, our OLS estimator \(\hat{\beta}\) is unbiased and consistent for \(\beta\)
  • Let’s do a quick proof for unbiasedness

\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]

  • Then we can obtain the conditional expectation of \(\mathbb{E}[\hat{\beta} | \mathbf{X}]\)

\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]

Regression review

  • Lastly, by law of total expectation

    \[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]

  • Therefore

    \[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]

  • Consistency requires us to show the convergence of \((\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\) to \(0\) in probability as \(N \to \infty\).

    • This actually requires weaker assumptions: \(\mathbb{E}[\mathbf{X}^{\prime}\epsilon] = 0\) but not necessarily \(\mathbb{E}[\epsilon | \mathbf{X}] = 0\).
  • But what have we not assumed?

    • Anything about the distribution of the errors!

Regression review

  • Assumption 4 - Spherical errors

\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]

  • Benefits
    • Simple, unbiased estimator for the variance of \(\hat{\beta}\)
    • Completes Gauss-Markov assumptions \(\leadsto\) OLS is BLUE (Best Linear Unbiased Estimator)
  • Drawbacks
    • Basically never is true

Regression review

  • Good news! We can relax homoskedasticity (but still keep no correlation) and do inference on the variance of \(\hat{\beta}\)

    \[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2_1 & 0 & \cdots & 0\\ 0 & \sigma^2_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2_n \end{bmatrix}\]

  • “Robust” standard errors using the Eicker-Huber-White “sandwich” estimator - Consistent but not unbiased for the true sampling variance of \(\hat{\beta}\)

    \[\widehat{Var(\hat{\beta})} = (\mathbf{X}^{\prime}\mathbf{X})^{-1} \mathbf{X}^{\prime}\hat{\Sigma}\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\]

    • \(\hat{\Sigma}\) is our estimate of the variance-covariance matrix using the squared residuals on the diagonals
    • Extensions to “clustered” standard errors that allow arbitrary correlation within groups.

Regression review

  • Assumption 5 - Normality of the errors

\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]

  • Not necessary even for Gauss-Markov assumptions
  • Not needed to do asymptotic inference on \(\hat{\beta}\)
    • Why? Central Limit Theorem!
  • Benefits?
    • Finite-sample inference.

Regression review

  • What do we need for OLS to be consistent for the “best linear approximation” to the CEF?
    • Very little!
  • What do we need for OLS to be consistent and unbiased for the conditional expectation function?
    • Truly linear CEF
    • But still no assumptions about the outcome distribution!
  • What do we need to do inference on \(\hat{\beta}\)?
    • We almost never assume homoskedasticity because “robust” SE estimators are ubiquitous
    • Even some forms of error correlation are permitted (“cluster” robust SEs)
    • Sample sizes are usually large enough where Central Limit Theorem implies a normal sampling distribution is a reasonable approximation.